Jaccard Score (Jaccard Similarity / Intersection-over-Union)#
The Jaccard score measures similarity between two sets:
In ML, you’ll often see the same idea as Intersection-over-Union (IoU) for binary masks.
Goals#
Build intuition for intersection vs union (and why true negatives don’t matter).
Derive the classification form: \(\displaystyle \frac{TP}{TP+FP+FN}\).
Implement Jaccard from scratch in NumPy (binary, multiclass, multilabel).
Use Plotly to visualize how thresholds and errors change the score.
Optimize a tiny logistic regression model with a differentiable soft Jaccard loss.
Quick import (scikit-learn)#
from sklearn.metrics import jaccard_score
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
versions = {
'numpy': np.__version__,
'plotly': __import__('plotly').__version__,
}
try:
import sklearn
versions['sklearn'] = sklearn.__version__
except Exception:
versions['sklearn'] = None
versions
{'numpy': '1.26.2', 'plotly': '6.5.2', 'sklearn': '1.6.0'}
Prerequisites & notation#
Binary labels: \(y \in \{0,1\}^n\)
Predicted labels: \(\hat{y} \in \{0,1\}^n\)
Predicted probabilities: \(p \in [0,1]^n\)
Confusion counts: \(TP\), \(FP\), \(FN\), \(TN\)
We’ll interpret the “positive set” as the indices where a vector equals 1: \(A = \{ i : y_i = 1 \}\) and \(B = \{ i : \hat{y}_i = 1 \}\).
1) Set intuition#
Think of two sets:
\(A\): the “true” items
\(B\): the “predicted” items
The Jaccard score is:
Numerator: what both agree on (overlap)
Denominator: everything that appears in either (coverage)
So Jaccard is high only when the overlap is large and the union isn’t bloated by extras.
A = {1, 2, 3, 5, 8}
B = {2, 3, 4, 8, 9}
intersection = A & B
union = A | B
jaccard = len(intersection) / len(union)
A, B, intersection, union, jaccard
({1, 2, 3, 5, 8},
{2, 3, 4, 8, 9},
{2, 3, 8},
{1, 2, 3, 4, 5, 8, 9},
0.42857142857142855)
universe = np.arange(0, 10)
A_mask = np.isin(universe, sorted(A))
B_mask = np.isin(universe, sorted(B))
# 0: neither, 1: A only, 2: B only, 3: both
cat = A_mask.astype(int) + 2 * B_mask.astype(int)
colorscale = [
[0.00, '#ffffff'],
[0.249999, '#ffffff'], # neither
[0.25, '#ff7f0e'],
[0.499999, '#ff7f0e'], # A only
[0.50, '#1f77b4'],
[0.749999, '#1f77b4'], # B only
[0.75, '#2ca02c'],
[1.00, '#2ca02c'], # both (intersection)
]
fig = go.Figure(
data=go.Heatmap(
z=cat[np.newaxis, :],
x=universe,
y=['elements'],
colorscale=colorscale,
zmin=-0.5,
zmax=3.5,
colorbar=dict(
title='category',
tickmode='array',
tickvals=[0, 1, 2, 3],
ticktext=['neither', 'A only', 'B only', 'A ∩ B'],
),
hovertemplate='element=%{x}<br>category=%{z}<extra></extra>',
)
)
fig.update_layout(
title=f'Jaccard = |A ∩ B| / |A ∪ B| = {len(intersection)}/{len(union)} = {jaccard:.3f}',
height=220,
margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()
2) Binary classification view (TP / FP / FN)#
For binary classification, focus on the positive class:
\(A = \{ i : y_i = 1 \}\) (true positives set)
\(B = \{ i : \hat{y}_i = 1 \}\) (predicted positives set)
Then:
\(|A \cap B| = TP\)
\(|A \cup B| = TP + FP + FN\)
So the Jaccard score becomes:
Notice what’s missing: true negatives \(TN\). If your dataset has tons of negatives, accuracy can look great while Jaccard stays low.
def confusion_counts_binary(y_true, y_pred):
y_true = np.asarray(y_true).astype(bool)
y_pred = np.asarray(y_pred).astype(bool)
tp = np.logical_and(y_true, y_pred).sum()
fp = np.logical_and(~y_true, y_pred).sum()
fn = np.logical_and(y_true, ~y_pred).sum()
tn = np.logical_and(~y_true, ~y_pred).sum()
return int(tp), int(fp), int(fn), int(tn)
def jaccard_score_binary(y_true, y_pred, *, zero_division=0.0):
tp, fp, fn, _ = confusion_counts_binary(y_true, y_pred)
denom = tp + fp + fn
if denom == 0:
return float(zero_division)
return tp / denom
def accuracy_score_binary(y_true, y_pred):
y_true = np.asarray(y_true).astype(int)
y_pred = np.asarray(y_pred).astype(int)
return (y_true == y_pred).mean()
# quick sanity check
y_true = np.array([1, 1, 0, 0, 1, 0])
y_pred = np.array([1, 0, 0, 0, 1, 1])
tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred)
(tp, fp, fn, tn), jaccard_score_binary(y_true, y_pred), accuracy_score_binary(y_true, y_pred)
((2, 1, 1, 2), 0.5, 0.6666666666666666)
2.1 IoU for segmentation (same formula)#
If \(y\) and \(\hat{y}\) are binary masks (pixels in/out of an object), then:
intersection = pixels correctly predicted as object
union = pixels that are object in either mask
So IoU = Jaccard on the set of “object pixels”.
h, w = 40, 40
yy, xx = np.mgrid[0:h, 0:w]
def circle_mask(*, cx, cy, r):
return (xx - cx) ** 2 + (yy - cy) ** 2 <= r**2
true_mask = circle_mask(cx=14, cy=20, r=10)
pred_mask = circle_mask(cx=18, cy=20, r=10)
# 0: background, 1: true-only (FN), 2: pred-only (FP), 3: overlap (TP)
cat = true_mask.astype(int) + 2 * pred_mask.astype(int)
iou = jaccard_score_binary(true_mask.ravel(), pred_mask.ravel(), zero_division=1.0)
colorscale = [
[0.00, '#ffffff'],
[0.249999, '#ffffff'],
[0.25, '#d62728'],
[0.499999, '#d62728'], # true-only (red)
[0.50, '#1f77b4'],
[0.749999, '#1f77b4'], # pred-only (blue)
[0.75, '#2ca02c'],
[1.00, '#2ca02c'], # overlap (green)
]
fig = go.Figure(
data=go.Heatmap(
z=cat,
colorscale=colorscale,
zmin=-0.5,
zmax=3.5,
showscale=True,
colorbar=dict(
title='pixel',
tickmode='array',
tickvals=[0, 1, 2, 3],
ticktext=['background', 'true only (FN)', 'pred only (FP)', 'overlap (TP)'],
),
hovertemplate='x=%{x}<br>y=%{y}<br>category=%{z}<extra></extra>',
)
)
fig.update_layout(
title=f'IoU (Jaccard) on a toy mask: {iou:.3f}',
width=520,
height=520,
yaxis=dict(scaleanchor='x', autorange='reversed'),
margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()
2.2 Why true negatives don’t matter#
Hold \(TP\), \(FP\), \(FN\) fixed and add more and more true negatives.
Accuracy goes up (because it counts \(TN\)).
Jaccard stays exactly the same (because it ignores \(TN\)).
tp, fp, fn = 10, 5, 5
y_true_core = np.array([1] * tp + [1] * fn + [0] * fp, dtype=int)
y_pred_core = np.array([1] * tp + [0] * fn + [1] * fp, dtype=int)
tn_sizes = np.arange(0, 2001, 100)
accs = []
jaccs = []
for tn in tn_sizes:
y_true_full = np.concatenate([y_true_core, np.zeros(tn, dtype=int)])
y_pred_full = np.concatenate([y_pred_core, np.zeros(tn, dtype=int)])
accs.append(accuracy_score_binary(y_true_full, y_pred_full))
jaccs.append(jaccard_score_binary(y_true_full, y_pred_full))
fig = go.Figure()
fig.add_trace(go.Scatter(x=tn_sizes, y=accs, mode='lines+markers', name='accuracy'))
fig.add_trace(go.Scatter(x=tn_sizes, y=jaccs, mode='lines+markers', name='jaccard'))
fig.update_layout(
title=f'Add more TN with TP={tp}, FP={fp}, FN={fn}: Jaccard stays constant',
xaxis_title='number of added true negatives (TN)',
yaxis_title='score',
yaxis=dict(range=[0, 1]),
)
fig.show()
3) How FP and FN move Jaccard#
For fixed \(TP\), Jaccard shrinks as you add either false positives or false negatives:
TP = 10
FP_vals = np.arange(0, 31)
FN_vals = np.arange(0, 31)
Z = np.zeros((len(FN_vals), len(FP_vals)), dtype=float)
for i, fn in enumerate(FN_vals):
for j, fp in enumerate(FP_vals):
Z[i, j] = TP / (TP + fp + fn)
fig = px.imshow(
Z,
x=FP_vals,
y=FN_vals,
origin='lower',
aspect='auto',
labels={'x': 'FP', 'y': 'FN', 'color': 'Jaccard'},
title=f'Jaccard for fixed TP={TP}',
)
fig.show()
4) Relationship to precision/recall/F1#
Precision: \(\displaystyle P = \frac{TP}{TP+FP}\)
Recall: \(\displaystyle R = \frac{TP}{TP+FN}\)
F1: \(\displaystyle F_1 = \frac{2TP}{2TP+FP+FN}\)
Jaccard uses the same ingredients but with a different denominator:
A useful identity links Jaccard and F1:
f1 = np.linspace(0, 1, 501)
j_from_f1 = f1 / (2 - f1)
fig = go.Figure()
fig.add_trace(go.Scatter(x=f1, y=j_from_f1, mode='lines', name='J = F1/(2-F1)'))
fig.update_layout(
title='Mapping between F1 and Jaccard',
xaxis_title='F1',
yaxis_title='Jaccard',
xaxis=dict(range=[0, 1]),
yaxis=dict(range=[0, 1]),
)
fig.show()
5) Multilabel and multiclass#
Multilabel#
Each sample can have multiple positive labels. If \(y, \hat{y} \in \{0,1\}^{n\times L}\), you can compute Jaccard:
per-sample and average (
samples)per-label and average (
macro)globally over all entries (
micro)
Multiclass#
With mutually-exclusive classes, a common definition is one-vs-rest per class and then average.
This matches the way sklearn.metrics.jaccard_score generalizes Jaccard when average != 'binary'.
def _safe_divide(num, den, *, zero_division=0.0):
num = np.asarray(num, dtype=float)
den = np.asarray(den, dtype=float)
out = np.full_like(num, float(zero_division), dtype=float)
mask = den != 0
out[mask] = num[mask] / den[mask]
return out
def jaccard_score_multilabel(y_true, y_pred, *, average='samples', zero_division=0.0):
y_true = np.asarray(y_true).astype(bool)
y_pred = np.asarray(y_pred).astype(bool)
if y_true.ndim != 2:
raise ValueError('Expected y_true with shape (n_samples, n_labels)')
if y_pred.shape != y_true.shape:
raise ValueError('y_pred must have the same shape as y_true')
if average == 'micro':
inter = np.logical_and(y_true, y_pred).sum()
uni = np.logical_or(y_true, y_pred).sum()
return float(_safe_divide(inter, uni, zero_division=zero_division))
if average in (None, 'none'):
inter_l = np.logical_and(y_true, y_pred).sum(axis=0)
uni_l = np.logical_or(y_true, y_pred).sum(axis=0)
return _safe_divide(inter_l, uni_l, zero_division=zero_division)
if average in ('macro', 'weighted'):
inter_l = np.logical_and(y_true, y_pred).sum(axis=0)
uni_l = np.logical_or(y_true, y_pred).sum(axis=0)
label_scores = _safe_divide(inter_l, uni_l, zero_division=zero_division)
if average == 'macro':
return float(label_scores.mean())
supports = y_true.sum(axis=0)
if supports.sum() == 0:
return float(zero_division)
return float(np.average(label_scores, weights=supports))
if average == 'samples':
inter_s = np.logical_and(y_true, y_pred).sum(axis=1)
uni_s = np.logical_or(y_true, y_pred).sum(axis=1)
sample_scores = _safe_divide(inter_s, uni_s, zero_division=zero_division)
return float(sample_scores.mean())
raise ValueError("average must be one of {'samples','micro','macro','weighted',None}")
def jaccard_score_multiclass(y_true, y_pred, *, average='macro', labels=None, zero_division=0.0):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.ndim != 1 or y_pred.ndim != 1:
raise ValueError('Expected 1D label arrays')
if y_pred.shape != y_true.shape:
raise ValueError('y_pred must have the same shape as y_true')
if len(y_true) == 0:
return float(zero_division)
if labels is None:
labels = np.unique(np.concatenate([y_true, y_pred]))
scores = []
supports = []
for lab in labels:
t = y_true == lab
p = y_pred == lab
tp = np.logical_and(t, p).sum()
fp = np.logical_and(~t, p).sum()
fn = np.logical_and(t, ~p).sum()
denom = tp + fp + fn
score = float(zero_division) if denom == 0 else float(tp / denom)
scores.append(score)
supports.append(t.sum())
scores = np.asarray(scores, dtype=float)
supports = np.asarray(supports, dtype=float)
if average == 'macro':
return float(scores.mean())
if average == 'weighted':
if supports.sum() == 0:
return float(zero_division)
return float(np.average(scores, weights=supports))
if average == 'micro':
correct = (y_true == y_pred).sum()
union = 2 * len(y_true) - correct
return float(zero_division) if union == 0 else float(correct / union)
if average in (None, 'none'):
return scores
raise ValueError("average must be one of {'micro','macro','weighted',None}")
# examples
y_true_ml = np.array(
[
[1, 0, 1],
[0, 1, 0],
[1, 1, 0],
[0, 0, 0],
],
dtype=int,
)
y_pred_ml = np.array(
[
[1, 1, 1],
[0, 1, 0],
[0, 1, 0],
[0, 0, 0],
],
dtype=int,
)
scores = {
'samples': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='samples', zero_division=1.0),
'micro': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='micro', zero_division=1.0),
'macro': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='macro', zero_division=1.0),
'weighted': jaccard_score_multilabel(y_true_ml, y_pred_ml, average='weighted', zero_division=1.0),
'per-label': jaccard_score_multilabel(y_true_ml, y_pred_ml, average=None, zero_division=1.0),
}
y_true_mc = np.array([0, 1, 2, 2, 1, 0])
y_pred_mc = np.array([0, 2, 2, 1, 1, 0])
scores, {
'multiclass_macro': jaccard_score_multiclass(y_true_mc, y_pred_mc, average='macro'),
'multiclass_micro': jaccard_score_multiclass(y_true_mc, y_pred_mc, average='micro'),
'multiclass_per_class': jaccard_score_multiclass(y_true_mc, y_pred_mc, average=None),
}
({'samples': 0.7916666666666666,
'micro': 0.6666666666666666,
'macro': 0.7222222222222222,
'weighted': 0.6666666666666666,
'per-label': array([0.5 , 0.6667, 1. ])},
{'multiclass_macro': 0.5555555555555555,
'multiclass_micro': 0.5,
'multiclass_per_class': array([1. , 0.3333, 0.3333])})
try:
from sklearn.metrics import jaccard_score as sk_jaccard_score
print('Binary (sklearn):', sk_jaccard_score(y_true, y_pred))
print('Binary (ours): ', jaccard_score_binary(y_true, y_pred))
print('Multilabel macro (sklearn):', sk_jaccard_score(y_true_ml, y_pred_ml, average='macro', zero_division=1.0))
print('Multilabel macro (ours): ', jaccard_score_multilabel(y_true_ml, y_pred_ml, average='macro', zero_division=1.0))
print('Multiclass macro (sklearn):', sk_jaccard_score(y_true_mc, y_pred_mc, average='macro'))
print('Multiclass macro (ours): ', jaccard_score_multiclass(y_true_mc, y_pred_mc, average='macro'))
except Exception as e:
print('sklearn not available:', e)
Binary (sklearn): 0.5
Binary (ours): 0.5
Multilabel macro (sklearn): 0.7222222222222222
Multilabel macro (ours): 0.7222222222222222
Multiclass macro (sklearn): 0.5555555555555555
Multiclass macro (ours): 0.5555555555555555
6) Thresholding probabilities#
Jaccard is defined on sets / hard labels. If your model outputs probabilities \(p\), you typically choose a threshold \(t\) and set:
Different thresholds trade off \(FP\) vs \(FN\), so they can change Jaccard a lot.
n = 400
y_true_thr = rng.binomial(1, 0.15, size=n)
# simulate a "model score": positives tend to have higher logits
logits = rng.normal(loc=0.0, scale=1.0, size=n) + 1.5 * y_true_thr
p_thr = 1 / (1 + np.exp(-logits))
thresholds = np.linspace(0.0, 1.0, 201)
j_scores = np.array(
[jaccard_score_binary(y_true_thr, (p_thr >= t).astype(int), zero_division=0.0) for t in thresholds]
)
best_idx = int(j_scores.argmax())
best_t = float(thresholds[best_idx])
best_j = float(j_scores[best_idx])
fig = px.line(
x=thresholds,
y=j_scores,
labels={'x': 'threshold', 'y': 'Jaccard'},
title=f'Jaccard vs threshold (best t≈{best_t:.2f}, J≈{best_j:.3f})',
)
fig.add_vline(x=best_t, line_dash='dash', line_color='black')
fig.update_layout(yaxis=dict(range=[0, 1]))
fig.show()
7) Using Jaccard in optimization: a soft Jaccard loss#
The “hard” Jaccard score uses discrete predictions, so it’s not differentiable w.r.t. model parameters.
A common trick (especially in segmentation) is to replace hard predictions with probabilities \(p\):
Soft intersection: \(I = \sum_i y_i p_i\)
Soft union: \(U = \sum_i y_i + \sum_i p_i - \sum_i y_i p_i\)
Soft Jaccard:
Soft Jaccard loss:
Gradient w.r.t. a probability \(p_i\):
Then use the chain rule for logistic regression, where \(p_i = \sigma(x_i^\top w)\).
# Synthetic 2D binary classification (imbalanced)
n0, n1 = 900, 100
X0 = rng.normal(loc=[0.0, 0.0], scale=[1.0, 1.0], size=(n0, 2))
X1 = rng.normal(loc=[2.0, 2.0], scale=[1.0, 1.0], size=(n1, 2))
X = np.vstack([X0, X1])
y = np.array([0] * n0 + [1] * n1, dtype=int)
# shuffle
perm = rng.permutation(len(y))
X = X[perm]
y = y[perm]
# train/test split (pure NumPy)
test_size = 0.30
n_test = int(len(y) * test_size)
X_test = X[:n_test]
y_test = y[:n_test]
X_train = X[n_test:]
y_train = y[n_test:]
# standardize (fit on train)
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0) + 1e-12
X_train_s = (X_train - mu) / sigma
X_test_s = (X_test - mu) / sigma
# add bias column
Xb_train = np.c_[np.ones(len(y_train)), X_train_s]
Xb_test = np.c_[np.ones(len(y_test)), X_test_s]
fig = px.scatter(
x=X_train_s[:, 0],
y=X_train_s[:, 1],
color=y_train.astype(str),
title='Training data (standardized)',
labels={'color': 'class'},
)
fig.show()
Xb_train.shape, Xb_test.shape, float(y_train.mean()), float(y_test.mean())
((700, 3), (300, 3), 0.10285714285714286, 0.09333333333333334)
def sigmoid(z):
z = np.clip(z, -60, 60)
return 1 / (1 + np.exp(-z))
def log_loss(y, p, *, eps=1e-12):
y = np.asarray(y, dtype=float)
p = np.asarray(p, dtype=float)
p = np.clip(p, eps, 1 - eps)
return float(-np.mean(y * np.log(p) + (1 - y) * np.log(1 - p)))
def soft_jaccard_loss(y, p, *, eps=1e-12):
y = np.asarray(y, dtype=float)
p = np.asarray(p, dtype=float)
I = np.sum(y * p)
U = np.sum(y) + np.sum(p) - I
return float(1.0 - (I + eps) / (U + eps))
def soft_jaccard_grad_p(y, p, *, eps=1e-12):
y = np.asarray(y, dtype=float)
p = np.asarray(p, dtype=float)
I = np.sum(y * p)
U = np.sum(y) + np.sum(p) - I
Ieps = I + eps
Ueps = U + eps
dJdp = (y * Ueps - Ieps * (1 - y)) / (Ueps**2)
return -dJdp
def fit_logreg_gd(Xb, y, *, loss='log', lr=0.1, n_iter=400, l2=0.0, record_every=5):
y = np.asarray(y, dtype=float)
w = np.zeros(Xb.shape[1], dtype=float)
history = {'iter': [], 'loss': [], 'jaccard@0.5': []}
for t in range(n_iter):
z = Xb @ w
p = sigmoid(z)
if loss == 'log':
L = log_loss(y, p) + 0.5 * l2 * np.sum(w[1:] ** 2)
grad = Xb.T @ (p - y) / len(y)
grad[1:] += l2 * w[1:]
elif loss == 'soft_jaccard':
L = soft_jaccard_loss(y, p) + 0.5 * l2 * np.sum(w[1:] ** 2)
dLdp = soft_jaccard_grad_p(y, p)
dLdz = dLdp * p * (1 - p)
grad = Xb.T @ dLdz / len(y)
grad[1:] += l2 * w[1:]
else:
raise ValueError("loss must be 'log' or 'soft_jaccard'")
w -= lr * grad
if (t % record_every) == 0 or t == (n_iter - 1):
y_hat = (p >= 0.5).astype(int)
j = jaccard_score_binary(y.astype(int), y_hat, zero_division=0.0)
history['iter'].append(t)
history['loss'].append(float(L))
history['jaccard@0.5'].append(float(j))
return w, history
def best_threshold_for_jaccard(y_true, p, thresholds):
scores = np.array(
[jaccard_score_binary(y_true, (p >= t).astype(int), zero_division=0.0) for t in thresholds], dtype=float
)
best_idx = int(scores.argmax())
return float(thresholds[best_idx]), float(scores[best_idx]), scores
# Train two models:
# - standard logistic regression (log-loss)
# - logistic regression with a soft Jaccard loss
w_log, hist_log = fit_logreg_gd(
Xb_train,
y_train,
loss='log',
lr=0.2,
n_iter=400,
l2=0.01,
record_every=5,
)
w_iou, hist_iou = fit_logreg_gd(
Xb_train,
y_train,
loss='soft_jaccard',
lr=1.0,
n_iter=400,
l2=0.01,
record_every=5,
)
# Evaluate on test
p_test_log = sigmoid(Xb_test @ w_log)
p_test_iou = sigmoid(Xb_test @ w_iou)
j05_log = jaccard_score_binary(y_test, (p_test_log >= 0.5).astype(int), zero_division=0.0)
j05_iou = jaccard_score_binary(y_test, (p_test_iou >= 0.5).astype(int), zero_division=0.0)
ths = np.linspace(0.01, 0.99, 99)
best_t_log, best_j_log, curve_log = best_threshold_for_jaccard(y_test, p_test_log, ths)
best_t_iou, best_j_iou, curve_iou = best_threshold_for_jaccard(y_test, p_test_iou, ths)
summary = {
'log_loss': {'J@0.5': j05_log, 'best_t': best_t_log, 'best_J': best_j_log},
'soft_jaccard': {'J@0.5': j05_iou, 'best_t': best_t_iou, 'best_J': best_j_iou},
}
summary
{'log_loss': {'J@0.5': 0.6206896551724138,
'best_t': 0.38,
'best_J': 0.7666666666666667},
'soft_jaccard': {'J@0.5': 0.175,
'best_t': 0.51,
'best_J': 0.2857142857142857}}
# Training curves (loss)
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_log['iter'], y=hist_log['loss'], mode='lines', name='log-loss (train)'))
fig.add_trace(go.Scatter(x=hist_iou['iter'], y=hist_iou['loss'], mode='lines', name='soft Jaccard loss (train)'))
fig.update_layout(title='Training loss curves (different scales)', xaxis_title='iteration', yaxis_title='loss')
fig.show()
# Training curves (Jaccard at threshold 0.5)
fig = go.Figure()
fig.add_trace(
go.Scatter(x=hist_log['iter'], y=hist_log['jaccard@0.5'], mode='lines', name='log-loss model')
)
fig.add_trace(
go.Scatter(x=hist_iou['iter'], y=hist_iou['jaccard@0.5'], mode='lines', name='soft Jaccard model')
)
fig.update_layout(
title='Training: Jaccard@0.5 over iterations',
xaxis_title='iteration',
yaxis_title='Jaccard@0.5',
yaxis=dict(range=[0, 1]),
)
fig.show()
# Threshold tuning on test: maximize Jaccard
fig = go.Figure()
fig.add_trace(go.Scatter(x=ths, y=curve_log, mode='lines', name='log-loss model'))
fig.add_trace(go.Scatter(x=ths, y=curve_iou, mode='lines', name='soft Jaccard model'))
fig.add_vline(x=best_t_log, line_dash='dash', line_color='black')
fig.add_vline(x=best_t_iou, line_dash='dash', line_color='gray')
fig.update_layout(
title='Test: Jaccard vs threshold (vertical lines = best thresholds)',
xaxis_title='threshold',
yaxis_title='Jaccard',
yaxis=dict(range=[0, 1]),
)
fig.show()
8) Pros, cons, and where Jaccard shines#
Pros#
Interpretable overlap measure in \([0,1]\).
Ignores true negatives, which is great when negatives are overwhelming (e.g. segmentation background).
Natural fit for sets, sparse binary features, multi-label problems.
Symmetric: \(J(A,B)=J(B,A)\).
Cons#
Because it ignores \(TN\), it can be misleading when correct negatives matter.
The “hard” metric is non-differentiable, so you usually optimize a surrogate.
For small objects in segmentation, a small boundary shift can drop IoU a lot.
For multiclass/multilabel, results depend heavily on the averaging choice (
microvsmacrovssamples).
Good use cases#
Image segmentation / detection masks (IoU)
Multi-label classification (tags)
Information retrieval and matching (set overlap)
9) Pitfalls & diagnostics#
Union = 0 edge case: if both sets are empty, Jaccard is undefined (\(0/0\)). Decide a convention (
zero_division).Threshold choice: Jaccard can change a lot with the threshold; tune it on a validation set.
Averaging:
microfavors frequent labels/classesmacrotreats each label/class equally (often better for rare labels)samplesanswers: “how good are we per example?” (multilabel)
Compare with precision/recall to see whether low IoU comes from extra positives (FP) or missed positives (FN).
10) Exercises#
Create two predictions with the same accuracy but very different Jaccard. Explain the difference using \(TP/FP/FN/TN\).
For multilabel data, build a case where
microis high butmacrois low. What does that imply?Implement a soft Dice (F1) loss and compare its behavior to soft Jaccard on the same toy dataset.
References#
scikit-learn
jaccard_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.htmlIoU/Jaccard loss in segmentation (overview): https://en.wikipedia.org/wiki/Jaccard_index